Train-O-Matic: Large-Scale Supervised Word Sense Disambiguation in Multiple Languages without Manual Training Data
نویسندگان
چکیده
Annotating large numbers of sentences with senses is the heaviest requirement of current Word Sense Disambiguation. We present Train-O-Matic, a languageindependent method for generating millions of sense-annotated training instances for virtually all meanings of words in a language’s vocabulary. The approach is fully automatic: no human intervention is required and the only type of human knowledge used is a WordNet-like resource. Train-O-Matic achieves consistently state-of-the-art performance across gold standard datasets and languages, while at the same time removing the burden of manual annotation. All the training data is available for research purposes at http://trainomatic.org.
منابع مشابه
Huge Automatically Extracted Training Sets for Multilingual Word Sense Disambiguation
We release to the community six large-scale sense-annotated datasets in multiple language to pave the way for supervised multilingual Word Sense Disambiguation. Our datasets cover all the nouns in the English WordNet and their translations in other languages for a total of millions of sense-tagged sentences . Experiments prove that these corpora can be effectively used as training sets for supe...
متن کاملWord Sense Disambiguation Using Sense Examples Automatically Acquired from a Second Language
We present a novel almost-unsupervised approach to the task of Word Sense Disambiguation (WSD). We build sense examples automatically, using large quantities of Chinese text, and English-Chinese and Chinese-English bilingual dictionaries, taking advantage of the observation that mappings between words and meanings are often different in typologically distant languages. We train a classifier on ...
متن کاملWord Sense Induction and Disambiguation Rivaling Supervised Methods
Word Sense Disambiguation (WSD) aims to determine the meaning of a word in context and successful approaches are known to benefit many applications in Natural Language Processing. Although, supervised learning has been shown to provide superior WSD performance, current sense-annotated corpora do not contain a sufficient number of instances per word type to train supervised systems for all words...
متن کاملUnsupervised WSD based on Automatically Retrieved Examples: The Importance of Bias
This paper explores the large-scale acquisition of sense-tagged examples for Word Sense Disambiguation (WSD). We have applied the “WordNet monosemous relatives” method to construct automatically a web corpus that we have used to train disambiguation systems. The corpus-building process has highlighted important factors, such as the distribution of senses (bias). The corpus has been used to trai...
متن کاملSelf-training and co-training in biomedical word sense disambiguation
Word sense disambiguation (WSD) is an intermediate task within information retrieval and information extraction, attempting to select the proper sense of ambiguous words. Due to the scarcity of training data, semi-supervised learning, which profits from seed annotated examples and a large set of unlabeled data, are worth researching. We present preliminary results of two semi-supervised learnin...
متن کامل